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This  report  details  the  research  and  development  work  done  on  A^£C++  under  ONR 
grant  N00014-94-1-0448. 

1  Overview  of  MCC++ 

A4£C++  is  a  Machine  Learning  library  of  C++  classes.  General  information  about  the  library 
can  be  obtained  through  the  World  Wide  Web  at  URL 

http : //robotics . Stanford . edu :  /users/ ronnyk/mlc . html  . 

The  current  implementation  supports  supervised  learning  of  concepts  using  decision  trees, 
decision  graphs,  nearest-neighbor  (instance-based),  and  probabilities  (Naive-Bayes).  Algo¬ 
rithms  for  feature  subset  selection  and  discretization  can  work  with  any  of  the  induction 
algorithms. 

MCC++  object  code  for  Sun  is  available  through  the  World  W5de  Web.  Over  150  different 
sites  have  copied  the  A4>CC++  kit,  and  machine  learning  research  in  the  robotics  lab  at 
Stanford  is  enhanced  through  the  use  of  the  library.  All  the  algorithms  in  Ron  Kohavi’s 
dissertation,  for  example,  are  implemented  in  MCC++. 


2  Summary  of  Results 

As  detailed  in  the  statement  of  work  for  the  grant,  three  main  projects  were  proposed: 

1.  Search  algorithms. 

2.  General  Logic  Diagrams  (GLDs). 

3.  Data  manipulation  routines. 

We  now  describe  the  specific  w'ork  done  and  the  results  obtained. 
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2.1  Search  algorithms 


Hill  climbing  and  best-first  search  were  implemented  as  general  search  algorithms.  Attempts 
to  use  the  search  techniques  for  finding  small  decision  trees,  as  originally  envisioned,  did  not 
result  in  significant  performance  improvements;  however,  the  algorithms  were  then  used  for 
a  different  purpose,  feature  subset  selection,  and  important  research  results  were  obtained. 

In  John,  Kohavi  &  Pfleger  (1994),  the  wrapper  approach  to  feature  subset  selection  was 
proposed.  The  problem  of  feature  subset  selection  is  that  of  finding  features  that  are  relevant 
to  the  supervised  task  at  hand.  Feature  subset  selection  has  been  studied  for  many  years  in 
statistics,  pattern  recognition,  and  machine  learning;  however,  most  suggestions  were  based 
on  a  filter  approach  where  the  data  alone  determined  what  features  are  important,  thus 
ignoring  the  induction  algorithm.  The  proposed  approach  uses  the  induction  algorithm  as 
a  black  box  and  testing  its  performance  on  different  feature  subsets  to  determine  the  best 
set  of  features  for  future  predictions.  In  Kohavi  (1994a),  the  problem  was  generalized  and 
abstracted  into  a  search  with  probabilistic  estimates.  Best- first-search  was  used  and  was 
shown  to  be  superior  to  hill-climbing. 


The  work  on  feature  subset  selection  concentrated  on  decision-trees  as  the  underlying 
hypothesis  space;  ID3  (Quinlan  1986)  and  C4..5  (Quinlan  1993)  were  used  as  the  underlying 
induction  algorithms.  An  observation  was  made  that  very  few  features  were  actually  chosen 
by  the  algorithm  and  that  most  trees  were  complete,  f.e.,  they  tested  all  the  features.  This 
suggested  that  much  of  the  inductive  power  comes  from  finding  a  relevant  set  of  features, 
not  from  the  actual  tree-structure  that  was  used.  Testing  the  conjecture  using  A4£C++ 
was  extremely  easy;  the  same  day,  we  had  results  showing  that,  indeed,  for  discrete  datasets, 
performance  of  decision  tables  on  features  selected  by  the  wrapper  approach  was  comparable 
to  that  of  the  best  induction  algorithms.  The  work  was  reported  in  Kohavi  &  Frasca  (1994) 
and  a  more  systematic  study  with  a  better  understanding  of  the  underlying  phenomena  was 
reported  in  Kohavi  (1995a).  We  believe  that  this  surprising  result  would  never  have  been 
discovered  without  the  power  of  MCC++.  Testing  the  conjecture  without  MCC++  would 
have  required  a  long  time,  and  it  probably  would  never  have  been  done. 

Recent  work  on  feature  subset  selection  using  dynamic  operators  for  the  search  space  and 
the  use  of  other  induction  algorithms  was  reported  in  Kohavi  &  Sommerfield  (1995)  together 
with  a  discussion  on  overfitting  in  feature  subset  space. 

Another  use  for  the  wrapper  approach  is  that  of  parameter  tuning.  Given  an  algorithm 
with  different  possible  settings,  how  can  one  find  a  good  setting  for  the  task.  Kohavi  k  John 
(1995)  reported  significant  improvements  to  C4.5  Quinlan  (1993)  when  these  parameters 
were  tuned  automatically  using  the  wrapper  approach. 


2.2  General  Logic  Diagrams 


General  Logic  Diagrams,  or  GLDs,  were  originally  proposed  by  Michalski  (Michalski  1978). 
The  diagrams  allow  viewing  multi-dimensional  discrete  spaces  and  can  help  researchers  gain 
insight  to  the  induced  concept  by  inspecting  it. 
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GLDs  were  implemented  in  MCC++  and  were  used  for  illustrative  purposes  in  (Kohavi 
19946). 


2.3  Data  manipulation 

Data  conversions  to  local  and  binary  encodings  were  implemented.  Three  algorithms  for 
discretization  of  continuous  features  were  implemented:  uniform  binning,  the  IR  discretiza¬ 
tion  proposed  in  Holte  (1993),  and  the  entropy-ba.sed  discretization  proposed  in  Fayyad  & 
Irani  (1993)  and  Catlett  (1991).  The  methods  were  compared  in  Dougherty,  Kohavi  &  Sa- 
hami  (1995).  The  Naive-Bayes  algorithm  (Langley,  Iba  k  Thompson  1992)  was  shown  to 
dramatically  improve  in  accuracy  after  discretization. 

2.4  Related  Projects 

The  ONR  grant  was  acknowledge  in  papers  that  were  not  directly  related  to  the  grant,  but 
which  nonetheless  indirectly  profited  from  the  supported  work  (Kohavi  19956,  Kohavi  k  Li 
1995,  Kohavi,  John,  Long,  Manley  k  Pfleger  1994). 


3  Summary 

MCC++  has  been  extremely  helpful  in  our  research  and  is  currently  helping  other  researchers 
in  comparing  different  algorithms  for  given  datasets.  Work  on  the  library  is  continuing  in 
an  effort  to  improve  the  quality  and  enlarge  the  number  of  useful  tools  we  can  provide. 

The  main  research  contribution  was  the  work  on  feature  subset  selection.  The  proposed 
wrapper  approach  was  very  successful  and  was  already  used  by  other  researchers  (Langley 
k  Sage  1994,  Aha  k  Banker!  19946,  Aha  k  Banker!  1994a,  Mladenic  1995).  The  imple¬ 
mentation  of  the  different  discretization  algorithms  has  led  to  a  better  understanding  of  the 
methods.  In  some  cases  (most  notably,  Naive-Bayes).  performance  using  the  discretized  data 
is  significantly  better,  surpassing  that  of  the  best  known  algorithms  for  many  datasets.  The 
implementation  of  general  logic  diagrams  provides  researchers  with  another  tool  for  viewing 
data. 
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